Simple models for scaling in phylogenetic trees 
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Abstract 



]~~"iManv processes and models -in biological, phys- 
^ ical, social, and other contexts- produce trees 
whose depth scales logarithmically with the 
Q number of leaves. Phylogenetic trees, describ- 
•pl ing the evolutionary relationships between bio- 
I logical species, are examples of trees for which 
I ^ such scaling is not observed. With this moti- 
vation, we analyze numerically two branching 
models leading to non-logarithmic scaling of the 
depth with the number of leaves. For Ford's 
r alpha model, although a power-law scaling of 
^ the depth with tree size was established analyt- 
ic ically, our numerical results illustrate that the 
^ asymptotic regime is approached only at very 
OO large tree sizes. We introduce here a new model, 
jthe activity model, showing analytically and nu- 
. ^ merically that it also displays a power-law scal- 
^ ing of the depth with tree size at a critical pa- 
^ rameter value. 
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Although most 
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modern studies on com- 



Boccaletti et all 



■ Albert & BarabasiL 120021: 



2006ll consider situations 
in which nodes are connected by multiple 
paths, the case of trees, i.e. graphs without 
closed cycles, is relevant to describe many 
natural and artificial systems. Branching 
in re al trees [S tevens, 1974], in blood ves- 
sels iWestetal.1. 119971). in river networks 



llRodriguez-Iturbe & Rinaldol. 1199711 or in 



comp uter file systems iKlemm et all I2005L 
2006ll produce complex tree patterns worth 



to be described and understood. Trees are 
the outcome of classifications algorithms 

and of branching pro- 
and they also arise 



1988tl 



1963] 



llJain & DubesL 
cesses jHarrisL 
when compu t ing community structure 
BGuimera et al.L 120031] or as a backbone (for 
example a minimum spanning tre e) fo r more 

120031: 



connected networks LGarlaschelli et al 



20071: iRozenfeld et al.l 



Hernan dez-Garcia et al. 
[2008l] . 

Evolutionary processes leading to speciation 
are also sum marized in phylogenetic trees 
iCracraft & Do noghue. 2004]. In these trees the 
leaves represent living species and each inter- 
nal node represents a branching event in which 
an ancestral species diversified into daughter 
species. Every internal node is thus the root of 
its associated subtree which consists of all its de- 
scendant nodes. Phylogenetic tree topology en- 
codes information on evolutionary mechanisms 
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earliest mathematical model of evo- 



2008]. 
The 

lution ary branching was proposed by lYule 
1 1925l] . Apart from the distinction he intro- 
duced between genera and species diversifi- 
cation, the model is equivalent to the Equal 
Rates Markov (E RM) model [Harding, 1971; 



Cavalli-Sforza & Edwardsl. Il967l] : starting from 



a single ancestral species, one among the tree 
leaves existing at the present time is chosen 
at random, bifurcating into two new leaves. 
Then this operation is repeated for a number 
of time steps or, equivalently, until the tree 
reaches a desired size. The topological char- 
acteristics of the constructed trees are surpris- 



Figure 1: Mean depth (d) of trees in TreeBASE 
(circles) as a function of number of leaves n. 
Squares are obtained from computer simula- 
tions of the ERM model, behaving as Eq. ^ for 
large n. At large sizes, the depth in the real phy- 
logenetic trees scales with the number of leaves 
faster than the ERM behavior. For both real 
phylogenies and model, depth values for each 
tree size are obtained by logarithmic binning of 
the depth of all trees and subtrees with that size. 



ingly robust, being shared by apparently differ- 
ent models su ch as the coalescent and others 
l AldousL 2001 ]. Essentially what is needed is 
that different branches at a given time branch 
independently and with the same probabilities. 
When extinction is taken into account, the same 
topology is recovered when considering only the 
lineages surviving at the final time. One of 
the characteristics of this type of branching 
is a distribution of subtree sizes A scaling at 
large sizes as A~'^, an outcome robustly ob- 
served in many natural and artificial systems 
and in classification schemes , including tax- 
onomies [ Burlando. 1990; Cald arelH et al.Ll2004l: 
Capocci et al.L 120081) . Another important char- 
acteristic is that the mean depth of the tree {d) 
(i.e. the average distance, measured in number 
of links, from the leaves to the root) scales loga- 
rithmically with the number of leaves n: 



(d) ~ log n . 



(1) 



It is worth noting that these results apply not 
only to many random branching models, but also 
to the simple deterministic Cayley tree, in which 
all internal nodes at a given level split in a fixed 
number of daughter nodes. 

In view of this generality it was surprising 
to find that the topology of observed phyloge- 
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Figure 2: Examples of trees with 32 leaves, gen- 
erated from several models, a) Tree generated 
with the ERM model, which is equivalent to the 
alpha model with a = 0. b) The completely un- 
balanced tree, which is equivalent to the alpha 
model with a = 1. c) A tree generated with the 
alpha model for a = 0.5. d) A tree generated 
with the activity model for p = 0.5. The trees 
in c) and d) display an imbalance intermediate 
between a) and b). 



nies does not agree wi th any of these predictions 
I Herrada et al. . I 2OO8I1 . In fact, it was known 
since some time ago that real phylogenies are 
substantially more unbalanced than predicted 
by the ERM and similar models [Aldous, 200l|; 
Blum & Francoi I I2OO6I1 . This means that some 
lineages diversify much more than others, in 
a way that is statistically incompatible with 
the ERM predi ctions. Figure [1] compares data 
iHerrada et al.L [20081 compiled from TreeBASE, 
a public repository containing several thousands 
of empirical phylogenetic trees corresponding to 
virtually all kinds of organisms in Earth, with 
the predictions of the ERM model. For the phy- 
logenetic trees at large sizes the mean depth 
scales with the number of leaves faster than the 
ERM behavior in Eq.©. 

The breakdown of the ERM behavior indicates 
that evolutionary branching should present cor- 
relations either in time or between the dif- 
ferent branches. Mechanisms producing trees 
with non-ERM scaling for the depth have been 
identified, as for example th e situation of 
critical branching toe Los RiosL I2OOII: iHarrisL 
1963] or optimizatio n of transport processes 
I Banavar et al.L Il999n . In the phylogenetic con- 



text models of this type have been proposed 
fAldous, 2001; ' Pinehsi (20031: iBlum & Francoisl. 
[2006giFord|,|200S" although most of them lack a 
clear interpretation in biological terms. 

In the following we present results for two 
branching models showing asymptotically non- 
ERM, i.e. non-logarithmic, scaling for the depth. 
Their study is motivated, on the one hand, by 
the empirical results above from real phyloge- 
netic trees. On the other, they pertain to the 
small set of available models with non-ERM 
scaling which are defined dynamically (i.e. by a 
set of rules that are applied to the present state 
of a growing tree to find the state at the next 
time step) rather than being characterized glob- 
ally by statistical or optimization prescriptions. 
The first model we present. Ford's alpha model, 
is a simple example for which the non-trivial 
asymptotic scaling (of the power law type) has 
been analytically identified. We analyze it nu- 
merically to confirm this prediction and to dis- 
play the behavior at finite sizes. We introduce 
later a new model, the activity model, which also 
leads to non-logarithmic depth scaling at a crit- 
ical parameter value. 

2 Ford's alpha model 



FordI ||2006[] introduced a model for recursive 
tree formation: At a given step in the process 
the tree is a set of leaves connected by terminal 
links to internal nodes, which are themselves 
connected by internal edges until reaching the 
root (the root itself is considered to have a single 
edge, which we count as internal, joining to the 
first bifurcating internal node; with this conven- 
tion a tree of n leaves has n — 1 internal edges). 
Then, a probability of branching proportional to 
1 - Q is assigned to each leaf, and proportional to 
a to each internal edge. By normalization these 
probabilities are, respectively, (1 — a)/(n — a), 
and a/(n— a). When a leaf is selected for branch- 
ing, it gives birth to a couple of new ones, as in 
the ERM model. But when choosing an internal 
edge, a new leaf branches from it by the inser- 
tion in the edge of a new internal node. For a = 
we have the standard ERM model. For a = 1 the 
completely unbalanced comb tree, in which all 
leaves branch successively from a main branch, 
is generated. Intermediate topologies are ob- 
tained for a G (0,1). Figure m shows examples 
of trees generated for different values of a. 



By considering the effect of the addition of 
new leaves o n the distances between root and 
other nodes, iFordI ll2006 ^ derived exact recur- 
rence relationships which, when written in 
terms of the average depth, lead to: 



n 



n — a 



{d)n + 



2n{l - 2a) 
{n + l)(n — a) 



(2) 



is the mean depth of the leaves of a tree 
with n leaves. By assuming a behavior ~ n" 
at large n, and expanding Eq. (H)) in powers of 
l/n, we get v = a,so that 



~ n 



, if < a < 1 . 



(3) 



If a = the standard ERM behavior, Eq. ([T]), is 
recovered. 

Figure [3] shows numerical results for the 
depth of trees generated with this model. Note 
that the predicted asymptotic behavior is at- 
tained but only at very large tree sizes, in gen- 
eral sizes much larger than the tree sizes of the 
examples shown in Fig. [2] and of the available 
empirical phylogenies. As analytically demon- 
strated [Ford. 20061 depth statistics of subtrees 
of given size extracted from a large tree behave 
as data from trees of that size directly generated 
by the alpha model algorithm. 

While the Ford model gives a simple mecha- 
nism for scaling in trees with a tunable expo- 
nent, the dynamical rule of posterior insertion 
of inner nodes is hard to justify in the context of 
evolution (although one can think on the mod- 
elling of errors arising in phylogenetic recon- 
struction methods when incorrectly assigning 
a splitting to a non-existing ancestral species). 
This motivates the introduction of a new model 
described in the next section. 



3 Activity model 

In this section we show that tree shapes distinct 
from the ERM model may also result from a 
memory in terms of internal states of the nodes. 
The activity model proposed here is conceptu- 
all y similar to t he class of models suggested 
by IPinelia ll2003l] . However, the present model 
distinguishes only between active and inactive 
nodes and has a single parameter controlling 
the spread of activity. 

Starting from a single node (the root), a bi- 
nary tree is generated as follows. At each step, 
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Figure 3: Depth statistics vs tree size for the alpha model. Symbols indicate the mean depth of leaves 
from root, averaged over the 100 trees generated for each size (2^^, = 3, 4, 15), and the error bars 
are the corresponding standard deviations. The points in the rugged lines come from each subtree 
of all trees generated. The dashed segments indicate the analytic predictions [Ford. ,2006.] for the 
scaling at large n. The inset highlights the logarithmic scaling of the a = case. 



a leaf i of the tree is chosen and branched into 
tw^o new leaves. Each of the tw^o new^ leaves, in- 
dependently of the other, is set active with prob- 
ability p or inactive v^^ith probability I — p. The 
branching leaf i is chosen at random from the 
set of active leaves if this set is non-empty. Oth- 
erwise, i is chosen at random from the set of 
all leaves. Figure |4] shows that for p = 1/2 the 
model generates trees with mean depth growing 
as the square root of tree size (note the log-log 
scale). Figure [2] displays a small-size example of 
such trees. For values of p below or above 1/2, 
(d) seems to increase logarithmically with n. 

Here we give a simplified argument to under- 
stand the observed exponent 1/2 of the distance 
scaling with system size in the case p = 1/2. At 
the time the growing tree has n leaves in total, 
let Da{n) be the expected sum of distances of ac- 
tive leaves from the root, and Dfy{n) the anal- 
ogous quantity for the inactive leaves. When a 
randomly chosen active leaf -at distance da from 
root- branches, the expected increase of Da{n) is 

^Da{n) = Da{n + l)-Da{n) = 

pHda + 2) + 2p{l-p)-l + il-pf{-da) 



= {2p-l)da + 2p. (4) 

Here the three terms of the second line are for 
the activation of two, one and zero of the new 
leaves, respectively. This expression is appro- 
priate as far as the number of active nodes is 
not zero. Simultaneously, the expected change 
in Dh{n) during the same event is 

ADb{n) = 

/•O + 2pil-p){da + l) + {l-pf2ida + l) 
= 2(1 - p){da + 1) . (5) 

We now average ADa{n) over the different 
choices of the particular active leave that has 
been branched. This amounts to replacing da in 
the above formulae by the average depth 

of the active leaves in a tree of n leaves. Writ- 
ing Di{n + 1) = Di{n) + ADi{n), for i = a,b, 
one would get a closed system for the quanti- 
ties Di{n) provided (da)„ is expressed in terms 
of them. This can be done by writing (da)„ = 
Da{n) /a{n), where a{n) is the expected number 
of active leaves in a tree of n leaves. This ex- 
pected value is used here as an approximation 
to the actual number of active leaves. 




Figure 4: Average depth versus size for the activity model for various values of the activation prob- 
ability p. Data points displayed by symbols give the average distance of leaves with respect to the 
root. Error bars give the standard deviation taken over different realizations (1000 trees per data 
point). Data in the rugged curves are for all subtrees of trees with size 2^^ = 2097152. The dashed 
line represents a power law scaling with exponent 1/2, corresponding to the scaling of the p = 0.5 
curve, as discussed in the text. 



The recurrence equations for Di{n) are spe- 
cially simple in the most interesting case p = 
1/2, since the dependence in disappears 
from one of the equations: 



Dain + 1) 
Db{n + 1) 



Dain) + 1 

Dbin) + {da)n + 1 



The solution (with initial condition Da{l) 
Eq. (HI) is simply: 



Dain) 



n 



1 . 



(6) 
(7) 

0) of 
(8) 



Since the probabilities of an increment or decre- 
ment (by one unit) of the number of active leaves 
are the same and time-independent for p = 1/2, 
the number of active nodes performs a symmet- 
ric random walk with a reflecting boundary at 
(this last condition arises from the prescrip- 
tion of setting active one node when the number 
of active nodes has reached zero in the previous 
step). For such random walk the expected value 
of active leaves a(n) increases as the square root 
of the number of steps. Since a new leaf is added 
at each time step, this leads to: 



a(n) 



n 



1/2 



(9) 



Combining ^ and ^ we obtain the average 
distance of active nodes from root at large tree 
sizes: 



Dain) 



an 



~ n 



1/2 



(10) 



Now we can plug this result into Eq. 
which can be solved recursively: 



n-l 



n-1 



Dbin)=Dbil) + J2i{da)t + l)^Y.t 



1/2 ^ „3/2 



t=l 



t=l 



(11) 

The totally averaged depth {d)^, which counts 
both the active and the inactive leaves, is 



Da{n)+Db{n) 



" n n 

(12) 

which explains the asymptotic behavior ob- 
served in Fig. IHfor p = 1/2. 

We note that the growth dynamics presented 
here rn a y be mapped to a branching process 
iHarrisLll963[] . with the difference that here the 
death (inactivation) of a node does not lead to its 
removal from the tree. The special case p = 1/2 
corresponds to a critical branching process. 



1/2 ^ „3/2 



1/2 



4 Discussion 

We have presented and studied two simple 
models which lead to non-logarithmic scal- 
ing of the tree depth. In contrast with 
many of the available mode ls having this 
behavior llBanavar et aP, 'l999'; 'Aldous',l20oi 



Blum & F rancois. 2006; Ford, 2006] they are for- 
mulated as dynamical models involving growing 
trees, so that rules are given to obtain the tree at 
the next time step from the present state. Their 
study has been motivated by data from phylo- 
genetic branching, and they are interesting ad- 
ditions to our present understanding of complex 
networks and trees. 

A recent analysis of several evolution- 
ary models including species competition 



llStich & ManrubiaL 1200811 indicates that in 



these models correlations are finally destroyed 
by mutation processes and persist only for a 
finite correlation time. Thus sufficiently large 
trees would have a scaling behavior closer to the 
asymptotic ERM predictions. Since the largest 
phylogenies in databases such as TreeBASE 
have only some hundreds of leaves, it is possible 
that the observed imbalance and depth scaling 
is a finite-size regime. Nevertheless models 
going beyond the ERM scaling are needed at 
least to explain this finite-size regime, and 
also to elucidate the true asymptotic scaling 
behavior. Here, we have also observed large 
finite-size transients in the alpha model of Sect. 
2. 

The different types of scaling of depth with 
size can be interpreted as indicating different 
values of the (fractal) dimensionality of the 
trees. This is so because (d) is a measure of the 
diameter of the tree, and because for a binary 
tree the total number of nodes is simply twice 
the number of leaves. Since the simpl est defini- 
tion o f dimension D oi a network [Egul luz et al. . 
200311 is given by the growth of the number 
of nodes as the diameter increases, n ~ {d)^, 
power law scaling of the type (d) ~ n'^ indi- 
cates that the tree can be thought as having a 
dimension D = l/v. The logarithmic scaling 
in the ERM model is an example of the small- 
world behavior common t o many network struc- 
tures [ Albert & Barabasil l20021. which is equiv- 
alent to having an effective infinite dimension- 
ality, whereas the power law scaling reveals a 
finite dimension for the tree, which implies a 
more constrained mode of branching. The al- 



pha model produces trees with tunable dimen- 
sion from 1 to oo, and the critical activity model 
gives two-dimensional trees. 

The final aim of the modelling of phylogenetic 
trees is to provide biological mechanisms ex- 
plaining the branching topology of the Tree of 
Life. In this direction, the branching of inter- 
nal edges in the Ford model has no obvious bi- 
ological interpretation. The activity model puts 
the mechanisms of birth-death critical branch- 



ing llHarrisl. 11963 1 within a framework of tran- 
sitions between node inte rnal s tates similar in 
spirit to the approach of IPinel is [2003]. The 



need to tune a parameter to attain the non-ERM 
critical behavior is however a limitation for its 
applicability. Much additional work is needed 
to identify the proper biological mechanisms be- 
hind evolutionary branching and adequate mod- 
elling of them. 
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